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Abstract 

Background: Mass spectrometry (MS) is a very sensitive and specific method for protein identification, biomarl<er 
discovery, and biomarl<er validation. Protein identification is commonly carried out by comparing MS data with 
public databases. However, with the development of high throughput and accurate genomic sequencing 
technology, public databases are being overwhelmed with new entries from different species every day. The 
application of these databases can also be problematic due to factors such as size, specificity, and unharmonized 
annotation of the molecules of interest. Current databases representing liquid chromatography-tandem mass 
spectrometry (LC-MS/MS)-based searches focus on enzyme digestion patterns and sequence information and 
consequently, important functional information can be missed within the search output. Protein variants displaying 
similar sequence homology can interfere with database identification when only certain homologues are examined. 
In addition, recombinant DNA technology can result in products that may not be accurately annotated in public 
databases. Curated databases, which focus on the molecule of interest with clearer functional annotation and 
sequence information, are necessary for accurate protein identification and validation. Here, four cases of curated 
database application have been explored and summarized. 

Findings: The four presented curated databases were constructed with clear goals regarding application and have 
proven very useful for targeted protein identification and biomarker application in different fields. They include a 
sheeppox virus database created for accurate identification of proteins with strong antigenicity, a custom database 
containing clearly annotated protein variants such as tau transcript variant 2 for accurate biomarker identification, a 
sheep-hamster chimeric prion protein (PrP) database constructed for assay development of prion diseases, and a custom 
Escherichia coli (£ coli) flagella (H antigen) database produced for MS-H, a new H-typing technique. Clearly annotating the 
proteins of interest was essential for highly accurate, specific, and sensitive sequence identification, and searching against 
public databases resulted in inaccurate identification of the sequence of interest, while combining the curated database 
with a public database reduced both the confidence and sequence coverage of the protein search. 

Conclusion: Curated protein sequence databases incorporating clear annotations are very useful for accurate protein 
identification and fit-for-purpose application through MS-based biomarker validation. 

Keywords: Curated database. Targeted protein identification, Sheeppox virus. Flagellar typing, Tau, Recombinant prion 
protein 
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Findings 

The maturity of modern genomic sequencing technology 
has seen genomic databases being generated for more and 
more species and public databases growing larger every 
day. Owing to advanced instrumentation and powerful 
search engines, this mounting comprehensiveness and the 
refinement of databases have benefited mass spectrometry 
(MS)-based protein identification and biomarker discov- 
ery. However, despite improvement in these areas, MS- 
based protein characterization using public databases has 
not yet been perfected for all species. For instance, annota- 
tion of individual genes and their related protein products 
has not been standardized. As the setup of sequence- 
focused protein identification by MS is primarily based on 
post-proteolytic enzyme-digested peptides, much import- 
ant annotation information, including the functions of 
proteins, can be ignored by the applied search engine [1]. 
It has been shown that search results can be optimized 
when using custom databases which focus on protein 
function with clear annotation, such as those generated 
using programs such as "Database on Demand" [1,2]. It 
has also been reported that search algorithms lose sensi- 
tivity when the search space (i.e. database size) is increased 
[3], and the more similar the database sequence to that of 
the protein of interest, the more accurate the search result 
[4]. These points are especially important during bio- 
marker discovery and validation, as well as the protein 
identification of "non-mainstream" organisms [5]. Cur- 
rently, many custom protein databases have been cre- 
ated to meet the special circumstances of the examined 
molecule, including prokaryotic ubiquitin-like protein 
(Pup) [6], proteins of O-GlcNAcylation [7], and a bio- 
molecular interaction network database [8]. 



In this paper, four projects spanning six years at the 
National Microbiology Laboratory in Canada, involv- 
ing curated database creation and application for the 
purpose of biomarker identification and validation, 
are presented. All MS-based protein identification 
was performed using liquid chromatography tandem 
mass spectrometry (LC-MS/MS) detection and a 
Mascot database search algorithm. All the curated 
databases are presented in FASTA file format in 
Additional file 1. The detected proteins of interest are 
shown in Table 1. 

The first project involved analyzing two SDS-PAGE (so- 
dium dodecyl sulfate polyacrylamide gel electrophoresis) 
protein bands derived from sheeppox virus [9]. A west- 
ern blot demonstrated that one protein band ("band 
A") was immunologically very reactive to serum from 
sheep infected with the virus and, if identified, could 
have implications in vaccine design and/or reagent de- 
velopment for viral diagnoses. In-gel digestion was 
performed on this band, and LC-MS/MS implemented 
on the extracted tryptic peptides for peptide separation 
and detection. Mascot (Matrix Sciences) was used to 
perform the database search. When searching the public 
database, MSDB (Mass Spectrometry Sequence Database; 
3,229,079 sequences; created by the Proteomics Group at 
Imperial College London), a protein identified as "putative 
virion core protein-lumpy skin disease virus" was identi- 
fied with a Mascot score of 859 and a matched peptide 
number of 51. When searching the curated poxvirus spe- 
cific database (21,000 sequences), created from the PBR 
(Poxvirus Bioinformatics Resource Centre) website (http:// 
www.poxvirus.org/index.asp?bhcp=l), a more accurate 
identification was obtained (i.e. the "sheeppox virus 



Table 1 Search output produced by searching MS sequence data of various peptides against curated databases (CD) 
and the public databases, MSDB, NCBInr, and PBR 

Project Sample Sample preparation Targeted protein Database: Top hit 

source 



Score 



Peptide number 



Score 



Peptide number 



I'' 



Sheeppox virus SDS-PAGE gel band Unknown band (104 kD) MSDB: lumpy disease virus protein 

859 51 



PBR: sheeppox virus protein 
1 039 80 



Human 



h-solution digest 



tau, transcript variant NCBInr: PNS specific tau, 78.8 kD CD: tau, transcript variant 2, 4027 kD 
2 (4027 kD) 



Sheep-hamster SDS-PAGE gel band 
(chimera) 



£ coli 



h-solution digest 



Sheep-hamster 
chimeric PrP 

Flagellin H37 



465 29 (17)1 

NCBInr: PrP in Dpc IVIicelles 
4987 1(1) 



1615 34 (27) 

CD: sheep-hamster chimeric PrP 
3857 9(8) 



NCBInn bacterial flagellin (£ coli) CD: H37, gi|30059966| 

18862 31(26) 29742 33(31) 



QSTAR system was used to test the samples and Mascot database search with 0.4 kD peptide mass tolerance, 0.4 kD MS/MS tolerance, two missed tryptic 
cleavages, possible methionine oxidation, and all cysteine residues as carboxamidomethyl-cysteine due to alkylation with iodoacetamide. 

'^An Orbitrap system was used with 30 ppm peptide mass tolerance, 0.5 kD MS/IVIS tolerance, and two missed tryptic cleavage for all database searches. Oxidation 
on methionine and deamidation on glutamine and asparagines were chosen as possible modifications. 

fNumbers without brackets denote total specific peptide match numbers while numbers in brackets denote significant specific peptide match numbers as per 
the Mascot search engine. 
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protein") with higher confidence (Mascot score = 1039) 
based on 80 peptide matches (Additional files 2 and 3). 
This observation clearly demonstrates that a smaller but 
more focused database is very useful for confirmation and 
validation of the molecule under study. 

The second project employed MS to detect a protein 
with transcript variants. Microtubule-associated protein 
tau (or simply "tau") has several variant forms [10,11]; ex- 
amined in this study was tau transcript variant 2 (tau-2, 
GenBank accession NM_005910), routinely used in our la- 
boratory as a biomarker for prion disease diagnosis [12]. 
When tau-2 MS data was searched against the public 
database, NCBInr (National Center for Biotechnology In- 
formation Non-Redundant), the "peripheral nervous sys- 
tem (PNS) specific tau" protein was primarily identified 
(Table 1, Additional file 4), when in fact tau-2 is a central 
nervous system tau variant. Moreover, top hits represent- 
ing different variants of the same protein were obtained 
from searches using in-gel and in-solution digestions. 
These inconsistencies rendered quality control assess- 
ments of MS data difficult and consequently, a curated 
database with clear annotations was used to perform the 
search, where a consistent result was obtained (Table 1, 
Additional file 5). 

In the third project, a curated database was employed to 
detect a protein that does not normally exist in nature. A 
recombinant sheep-hamster chimeric prion protein was 
designed for use in a novel and promising assay called 
"real-time quaking-induced conversion" (RT-QuIC), where 
low levels of infectious prion can be detected in human 
cerebral spinal fluid [13]. When the NCBInr database was 
used to confirm the existence of the chimeric protein from 
a digested SDS-PAGE band, only one peptide representing 
prion protein from different species (i.e. neither sheep nor 
hamster) was revealed (Table 1, Additional file 6), while the 
actual proteins [hamster (Mesocrketus auratus) and sheep 
{Ovis aries)] represented only the third and fourth hits, re- 
spectively. In order to accurately identify the chimeric pro- 
tein, a curated database called "PrpSheep-Hamster" was 
created to accurately annotate and identify the protein 
(Table 1, Additional file 7). Indeed, database searches of MS 
data obtained from two separate but identical in-gel 
digested protein bands demonstrated that higher identi- 
fication confidence and more sequence-specific peptide 
matches resulted from the smaller, more focused 



database (Table 2). This situation exemplifies that the 
characterization of proteins possessing rare tryptic en- 
zyme digestion sites for MS analysis may benefit by 
using smaller and hence more accurate databases. 

The fourth project highlights the ability of both MS 
and curated protein database to supplement traditional 
E. coli flagellar serotyping. As there are 53 flagellar sero- 
types in E. coli bacteria, serotyping by way of antigen- 
antibody agglutination reactions is a costly and tedious 
process [14,15]. In response to this, a unique method 
was developed to enrich flagella for high quality MS de- 
tection and identification [15], but problems arose when 
specific H types (i.e. serotypes) could not be obtained 
when searching the resulting MS data against the 
NCBInr database. Using the flagellar serotype H37, for 
example, a search of NCBInr listed the sequence as sim- 
ply "flagellin" (Table 1, Additional file 8). To solve this 
problem, a curated E. coli flagellar database representing 
all serotypes was created as a FASTA file, using sequence 
data obtained from this public database of NCBInr. The 
custom database was used to successfully identify all ex- 
amined flagella H types from reference E. coli strains 
[15] (Table 1 and Additional file 9 shows one example, 
H37). Searches using only the curated database, rather 
than using the curated and public database, Swissprot, in 
conjunction, also produced a larger number of matched 
peptides with higher confidence scores and often attained 
better coverage amidst shorter search times (Table 3). 
Lasdy, MS sequence searches against the curated and pub- 
lic database, Swissprot and NCBInr, demonstrated that 
only the smaller, more focused curated database was able 
to obtain accurate top hit information with 100 % sensitiv- 
ity and specificity (Table 4). 

Conclusions 

With the growing comprehensiveness of many species' ge- 
nomes and the maturity of MS-based technology, bio- 
marker application and validation are being applied more 
and more for use in disease diagnosis and improvements 
of conventional bio-assay methods. From the above cases, 
it is evident that curated databases are very useful for ac- 
curate, specific, and consistent identification and confirm- 
ation of proteins and biomarkers of interest. Moreover, 
clearly annotated, fit-for-purpose databases prove ex- 
tremely useful for high quality and standardized method 



Table 2 Search output produced by searching sheep-hamster PrP MS sequence data against a curated prion protein 
database (CD) alone and in conjunction with the public database, Swissprot 

Sample CD° only CD and Swissprot 

Mascot score Peptide identified Mascot score Peptide identified 

SDS-PAGE gel band (replicate 1) 4117 12(11)1 2232 12(10) 

SDS-PAGE gel band (replicate 2) 2734 10(8) 1540 10(7) 

DNumbers without brackets denote total specific peptide match numbers while numbers in bracl<ets denote significant specific peptide match numbers as per 
the Mascot search engine. 



Cheng et al. BMC Research Notes 2014, 7:444 
http://www.biomedcentral.eom/1756-0500/7/444 



Page 4 of 6 



Table 3 Search output produced by searching E. coli flagellin MS sequence data against a curated E. coli flagellin 


database (CD) alone and in conjunction with the public database, Swissprot 




Strain 


Confirmed 


MS-H 


Mascot score 


Sequence identified 


Sequence coverage (%) 


number 


serotype 


type 


CD only ' 


CD and Swissprot 


CD only CD and Swissprot 


CD only CD and Swissprot 


E169 


HI 


HI 


14607 


10922 


57(55)1 57(49) 


98 98 


El 70 


H2 


H2 


1754 


1113 


37(34) 37(27) 


80 80 


E171 


H3 


H3 


8117 


5735 


52(46) 50(39) 


91 90 


El 72 


H4 


H4 


3894 


2893 


28(26) 28(21) 


89 89 


El 73 


H5 


H5 


1568 


1167 


26(23) 24(16) 


81 74 


El 74 


H6 


H6 


6123 


4513 


46(44) 46(38) 


90 90 


EDL933 


H7 


H7 


6131 


4511 


56(54) 55(48) 


90 90 


El 76 


H8 


H8 


5538 


3916 


44(43) 43(39) 


90 89 


El 77 


H9 


H9 


10426 


8099 


53(51) 52(47) 


80 80 


E659 


HIO 


HIO 


7281 


5042 


47(47) 47(41) 


98 98 


902380 


H7 


H7 


3421 


2515 


43(40) 42(35) 


84 82 


050958 


H7 


H7 


2656 


1999 


38(36) 38(31) 


78 78 


090414 


H7 


H7 


5223 


3943 


46(44) 45(42) 


94 94 


091349 


H7 


H7 


5887 


4459 


52(49) 52(46) 


94 94 


091350 


H7 


H7 


3404 


2522 


44(42) 43(37) 


89 88 


DNumbers without bracl<ets denote total specific peptide match numbers while numbers in brackets denote significant specific peptide match numbers as per 


the (Vlascot search engine. 












development and validation 


using MS -based technology. 


and reminder for all MS users, especially those performing 


Due to the sophistication of MS instrumentation and spe- 


specific and/or "non-mainstream" research and applica- 


cific software requirements, together with variations in 


tions, recombinant DNA technology quality control, and 


protein 


expression 


and posttranslational modifications, 


targeted biomarker identification and validation, to use cu- 


detection of analogous proteins through MS 


remains com- 


rated fit-for-purpose databases 


in order to consistently 


plicated. This paper will hopefully serve as an example 


and accurately identify MS data. 




Table 4 Top hits produced by searching E. coli flagellin MS data against a curated E. coli flagellin database (CD) and 


the public databases, Swiss-prot and NCBInr^ 






Strain number Confirmed serotype CD (195 sequences) 


Swiss-prot (331,337 sequen ces) 1 


NCBInr (25,303,445 sequences) 










top hit 


top hit 


top hit 


E169 




11 




HI 


Shigeila flagellin 


flagellin [£ colij 


El 70 




H2 




H2 


E. coli Elongation factor 


flagellin [£ coli] 


E171 




MB 




H3 


Salmonneila flagellin 


flagellin [£ coli] 


El 72 




H4 




H4 


£ coli K1 2 flagellin 


flagellin [£ coli] 


El 73 




H5 




H5 


£ coli K1 2 flagellin 


£ co// flagellar protein FIIC 


El 74 




H6 




H6 


Shigella flagellin 


FliC [£ Coli] 


EDL933 




H7 




H7 


Shigella flagellin 


flagellin [£ coli] 


El 76 




H8 




H8 


Shigella flagellin 


flagellin [£ coli] 


El 77 




H9 




H9 


Shigella flagellin 


flagellin [£ coli] 


E659 




HIO 




HIO 


£ coli K1 2 flagellin 


flagellin [£ coli] 


902380 




H7 




H7 


Shigella flagellin 


flagellin [£ coli] 


050958 




H7 




H7 


Shigella flagellin 


flagellin [£ coli] 


090414 




H7 




H7 


Shigella flagellin 


flagellin [£ coli] 


091349 




H7 




H7 


Shigella flagellin 


flagellin [£ coli] 


091350 




H7 




H7 


Shigella flagellin 


flagellin [£ coli] 



^An Orbitrap system was used with 30 ppm peptide mass tolerance, 0.5 kD MS/MS tolerance, one missed tryptic cleavage for all database searches. Oxidation on 
methionine and deamidatlon on glutamine and asparagine were chosen as a possible modification. 
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Availability of supporting data 

All the databases are available in the Additional file 1- 
Database.zip. Any questions regarding the application 
of the databases should be addressed to K. C. (Keding. 
Cheng@phac-aspc.gc.ca). 

Additional files 
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LC-MS/MS: Liquid-chromatography tandem mass spectrometry; MS: Mass 
spectrometry; MSDB: Mass spectrometry database; NCBInr: National Centre of 
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Resource Centre; PrP: Prion protein; Pup: Prokaryotic ubiquitin-like protein; 
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